Chapter 8 Lab 7

Objective: Welcome to Lab 7. In this lab, we are going to use real world data to perform a regression analysis exercise. We are also going to look at downloading census data using R.

  • To read in State Level census data from the tidycensus package
  • To subset to your city of choice
  • To test linear model assumptions.
  • To perform correlation analysis.
  • To examine model residuals.
  • To perform a Spatial lag model and interpret the output

Data-sets: Data from the American Community Survey.

Lab structure: Now you are getting more experienced in R, I will provide a worked example then get you to do something similar on your own data. YOUR CHALLENGE IS VERY SIMILAR TO THE TUTORIAL, so I urge you to get everything working on your machine, then edit each code chunk as necessary.

8.1 Lab 7 Set-Up

8.1.1 Create your Lab 7 project file

  • Open a new version of R-studio.
  • Click the file menu, then new project in a new directory.
  • Select your 364 directory, then name the project Lab 7.

If you are stuck on this process, see the start of previous labs. You can check this has worked by looking on the file explorer on your computer, then selecting your GEOG364 folder and checking that a Lab 7 folder has appeared with your Lab7.Proj file inside it.

8.1.2 Create your NoteBook file

Here you have a choice:

Either.. you can create a standard lab script as before:

  • Go to File/New File and create a new R-NOTEBOOK file.
  • Delete the friendly text (everything from line 6 onward)
  • Save your file as GEOG364_Lab7_PSU.ID.Rmd e.g. GEOG364_Lab7_hlg5155.Rmd
  • Follow Section 2.2.2 to modify your YAML code
  • Please make sure that your lab script has a floating table of contents (section 2.2.2, adding the toc_float: TRUE part)

Or..OPTIONAL:

If you want to explore some new formats, you can try one of the markdown templates stored in R.There are instructions on how to load them on the website here: https://github.com/juba/rmdformats/blob/master/README.md).

Again, make sure to save your file as GEOG364_Lab7_PSU.ID.Rmd. Please also add in a floating table of contents to your YAML code.

8.1.3 Style guide

A large component of the marks for your labs scripts focuses them being easily readable and easy to follow. Now you have had experience with R, here are some of the things we expect to get full marks:

  1. You include a floating table of contents, title and author in the YAML Code
  2. You include a “level 1” heading in your text for every lab challenge e.g. # Lab challenge 1
  3. It’s easy to see where your answers are - so you have full sentences, not just one word answers.
  4. You have spell-checked your work! The spell check button is between the save button and the knit/preview button.
  5. You include blank lines before and after each code chunk, or new paragraph, or bullet point set or heading (put many blank lines in markdown files and R can automatically format them correctly).
  6. Any written answers are thoughtful and well considered.

As you write your lab, regularly click the Preview or Knit Button and make sure everything looks correct and properly formatted. IF THERE ARE PROBLEMS, TALK TO AN INSTRUCTOR.

8.1.4 Download and run packages

Follow the instructions in Section 3.2.2. to download and install the following packages, plus any others that you are missing from the library code chunk.

  • corrplot
  • hrbrthemes
  • olsrr
  • plotly
  • rmapshaper
  • spatialreg
  • spatialEco
  • tidycensus
  • tidyverse
  • tigris
  • units

Now add a new code chunk in your script. Inside add the following code and run it.

Hint: make a new code chunk and copy the text below into it. Save the script. If you are missing packages, you might see a yellow bar pop up at the top of your script asking to install them - in that case just click install and it will automatically download them for you

library(corrplot)
library(hrbrthemes)
library(olsrr)
library(plotly)
library(rmapshaper) 
library(spatialreg)
library(raster)
library(spdep)
library(sp)
library(spatialEco)
library(sf)
library(tidycensus)
library(tidyverse)
library(tigris)
library(tmap)
library(units)

This needs to be at the top of your script because the library commands need to be run every time you open R. Now click the Preview or Knit Button and make sure everything looks correct. If you are not sure if there are errors, rerun the code chunk a few times, it should eventually run without any messages or output.

Don’t continue until you can make and view your html or nb.html file. If it doesn’t work, ask for help before moving on

8.1.5 Sign up for Census API

You can access census data within R, but you need to sign up in advance for a key. Do this now!

https://api.census.gov/data/key_signup.html

You can use Penn State as the organisation. This is just you promising that you will follow the terms and conditions when using this data. The system will quickly send you an e-mail you with a personal access code/key. Click the link in the e-mail to activate.

Once the e-mail arrives, make a new code chunk in your lab script and add the following code, where you replace the YOUR_KEY_HERE with the passcode they send you (in quote marks).

#tidycensus package
census_api_key("YOUR API KEY GOES HERE", install=TRUE)

Important! this seems to get angry if run mulitple times, so if you have run it once and it’s telling you that A census_api_key already exists, it’s OK to comment out this line of code.

8.2 Tutorial 1: Data Wrangling

In this tutorial I will show you how to:

  • Load census data from tidy-census
  • Load city boundary data from tigris
  • Crop a polygon shapefile to city borders
  • Load a point data-set showing the location of Chicago shops
  • Merge points with polygon data.

IF YOU HAVEN’T COMPLETED THE CODE SET-UP AND SET YOUR CENSUS KEY, STOP HERE, GO BACK AND DO IT NOW.

8.2.1 Loading census data from tidycensus

Hint: The US Census has hundreds of datasets that you can access. So in the future if there is an obscure census dataset you want to import into R, use this tutorial https://cran.r-project.org/web/packages/censusapi/vignettes/getting-started.html. Here we are going to focus on common datasets available through tidycensus, based loosely on this tutorial https://crd150.github.io

We are going to focus on installing demographic data from the American Community Survey using the tidycensus package.

The American Community Survey is a huge dataset for the US Population at Census tract scale. There are variables from population density, to demographic data, to employment and economic indicators. We don’t want to download all the data as that would be overwhelming. To see what variables are available to us, we will use the following command. This will take some time to run. You can see the full table by clicking on its name in the environment tab or using the View command

#tidycensus package
v17 <- load_variables(2017, "acs5", cache = TRUE)
head(v17)
#View(v17)

To search for specific data, select “Filter” located at the top left of this window and use the search boxes that pop up. For example, type in “population” in the box under “concept”. You should see near the top of the list the first set of variables we’ll want to download - population, with the unique ID “B01003_001”. I also searched for “income” under the label column and selected the total number of people with income $75 000 or more (ID:“B06010_011”) and the medium income (ID: "B19013_001). Alternatively, I looked them up at this website: https://www.socialexplorer.com/data/ACS2017_5yr/metadata/?ds=ACS17_5yr.

Now we have the ID for those commands, we want to download the data itself. The command/function for downloading American Community Survey (ACS) Census data is get_acs(). The command for downloading decennial Census data is get_decennial().

# Download illinois ACS data at census tract level for my chosen variables, using tidycensus
IL.Census <- suppressMessages(get_acs(geography = "tract", 
              year = 2017,
              variables = c(housevalue       = "B25075_001",  # house value
                            totp             = "B05012_001", # total population
                            tothouse         = "B25001_001", # total housing units	
                            med.income       = "B19013_001", # median income  
                            income.gt75      = "B06010_011", # number of people making > 75000 USD
                            for.born         = "B05012_003", # number of foreign born people
                            house.age        = "B25035_001", # average house age
                            month.house.cost = "B25105_001", # monthly house expenditures
                            med.gross.rent   = "B25064_001", # median rent
                            workhome         = "B08101_049", # number who work from home
                            owneroccupied    = "B25003_002", # total owner occupied
                            totalbeds        = "B25041_001", # total number of beds in the house
                            broadband        = "B28002_004"),# total with access to broadband
              state = "IL",
              survey = "acs5",
              geometry = TRUE))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |=========                                                             |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |============                                                          |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  24%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
# And set an appropriate UTM map projection. I found the EPSG code literally by googling Chicago UTM projection
IL.Census <- st_transform(IL.Census,26916)
IL.Census
## Simple feature collection with 40599 features and 5 fields (with 26 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 115638.5 ymin: 4093778 xmax: 457026.6 ymax: 4712668
## projected CRS:  NAD83 / UTM zone 16N
## First 10 features:
##          GEOID                                      NAME       variable
## 1  17001000202 Census Tract 2.02, Adams County, Illinois           totp
## 2  17001000202 Census Tract 2.02, Adams County, Illinois       for.born
## 3  17001000202 Census Tract 2.02, Adams County, Illinois    income.gt75
## 4  17001000202 Census Tract 2.02, Adams County, Illinois       workhome
## 5  17001000202 Census Tract 2.02, Adams County, Illinois     med.income
## 6  17001000202 Census Tract 2.02, Adams County, Illinois       tothouse
## 7  17001000202 Census Tract 2.02, Adams County, Illinois  owneroccupied
## 8  17001000202 Census Tract 2.02, Adams County, Illinois      house.age
## 9  17001000202 Census Tract 2.02, Adams County, Illinois      totalbeds
## 10 17001000202 Census Tract 2.02, Adams County, Illinois med.gross.rent
##    estimate  moe                       geometry
## 1      2473  258 MULTIPOLYGON (((124599.3 44...
## 2        21   24 MULTIPOLYGON (((124599.3 44...
## 3        92   51 MULTIPOLYGON (((124599.3 44...
## 4        45   27 MULTIPOLYGON (((124599.3 44...
## 5     42750 7229 MULTIPOLYGON (((124599.3 44...
## 6      1163   66 MULTIPOLYGON (((124599.3 44...
## 7       721   96 MULTIPOLYGON (((124599.3 44...
## 8      1949    7 MULTIPOLYGON (((124599.3 44...
## 9      1163   66 MULTIPOLYGON (((124599.3 44...
## 10      639  134 MULTIPOLYGON (((124599.3 44...
# If you just want to download a table, with no spatial attributes, set geometry to FALSE

In the above code, we specified the following arguments

  • geography: The level of geography we want the data in; in our case, the county. Other geographic options can be found here.
  • year: The end year of the data (because we want 2013-2017, we use 2017).
  • variables: The variables we want to bring in as specified in a vector you create using the function c(). Note that we created variable names of our own (e.g. “topop”) and we put the ACS IDs in quotes (“B03002_003”). Had we not done this, the variable names will come in as they are named in the ACS, which are not very descriptive.
  • state: We can filter the counties to those in a specific state. Here it is “CA” for California. If we don’t specify this, we get all counties in the United States. When we cover Census tracts in the next lab, a county filter will also be available.
  • survey: The specific Census survey were extracting data from. We want data from the 5-year American Community Survey, so we specify “acs5”. The ACS comes in 1-, 3-, and 5-year varieties.
  • geometry: If geometry is set to FALSE, it just brings in a standard table. If it’s set to true, it makes it an sf file.

Now let’s look at a summary:

IL.Census
## Simple feature collection with 40599 features and 5 fields (with 26 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 115638.5 ymin: 4093778 xmax: 457026.6 ymax: 4712668
## projected CRS:  NAD83 / UTM zone 16N
## First 10 features:
##          GEOID                                      NAME       variable
## 1  17001000202 Census Tract 2.02, Adams County, Illinois           totp
## 2  17001000202 Census Tract 2.02, Adams County, Illinois       for.born
## 3  17001000202 Census Tract 2.02, Adams County, Illinois    income.gt75
## 4  17001000202 Census Tract 2.02, Adams County, Illinois       workhome
## 5  17001000202 Census Tract 2.02, Adams County, Illinois     med.income
## 6  17001000202 Census Tract 2.02, Adams County, Illinois       tothouse
## 7  17001000202 Census Tract 2.02, Adams County, Illinois  owneroccupied
## 8  17001000202 Census Tract 2.02, Adams County, Illinois      house.age
## 9  17001000202 Census Tract 2.02, Adams County, Illinois      totalbeds
## 10 17001000202 Census Tract 2.02, Adams County, Illinois med.gross.rent
##    estimate  moe                       geometry
## 1      2473  258 MULTIPOLYGON (((124599.3 44...
## 2        21   24 MULTIPOLYGON (((124599.3 44...
## 3        92   51 MULTIPOLYGON (((124599.3 44...
## 4        45   27 MULTIPOLYGON (((124599.3 44...
## 5     42750 7229 MULTIPOLYGON (((124599.3 44...
## 6      1163   66 MULTIPOLYGON (((124599.3 44...
## 7       721   96 MULTIPOLYGON (((124599.3 44...
## 8      1949    7 MULTIPOLYGON (((124599.3 44...
## 9      1163   66 MULTIPOLYGON (((124599.3 44...
## 10      639  134 MULTIPOLYGON (((124599.3 44...

We can see we have a spatial polygons file in “long” format e.g. all the variables are stacked on top of each other in the same column, where each one also has a margin of error (the moe column). The long format isn’t ideal for our regression analysis, so let’s rearrange the data.

8.2.2 Re-arranging the data

To make this simpler, let’s remove the margin of error column:

# DON'T RUN THIS TWICE! IF YOU NEED TO RE-RUN, RUN THE ENTIRE CODE FIRST
IL.Census <- select(IL.Census, -(moe)) 
IL.Census
## Simple feature collection with 40599 features and 4 fields (with 26 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 115638.5 ymin: 4093778 xmax: 457026.6 ymax: 4712668
## projected CRS:  NAD83 / UTM zone 16N
## First 10 features:
##          GEOID                                      NAME       variable
## 1  17001000202 Census Tract 2.02, Adams County, Illinois           totp
## 2  17001000202 Census Tract 2.02, Adams County, Illinois       for.born
## 3  17001000202 Census Tract 2.02, Adams County, Illinois    income.gt75
## 4  17001000202 Census Tract 2.02, Adams County, Illinois       workhome
## 5  17001000202 Census Tract 2.02, Adams County, Illinois     med.income
## 6  17001000202 Census Tract 2.02, Adams County, Illinois       tothouse
## 7  17001000202 Census Tract 2.02, Adams County, Illinois  owneroccupied
## 8  17001000202 Census Tract 2.02, Adams County, Illinois      house.age
## 9  17001000202 Census Tract 2.02, Adams County, Illinois      totalbeds
## 10 17001000202 Census Tract 2.02, Adams County, Illinois med.gross.rent
##    estimate                       geometry
## 1      2473 MULTIPOLYGON (((124599.3 44...
## 2        21 MULTIPOLYGON (((124599.3 44...
## 3        92 MULTIPOLYGON (((124599.3 44...
## 4        45 MULTIPOLYGON (((124599.3 44...
## 5     42750 MULTIPOLYGON (((124599.3 44...
## 6      1163 MULTIPOLYGON (((124599.3 44...
## 7       721 MULTIPOLYGON (((124599.3 44...
## 8      1949 MULTIPOLYGON (((124599.3 44...
## 9      1163 MULTIPOLYGON (((124599.3 44...
## 10      639 MULTIPOLYGON (((124599.3 44...

For our data to effectively link to spatial data, we want there to be a single row for each county and one column for each variable e.g. we need to reshape the data from long format to wide format. We can do this using the pivot_wider command.

IL.Census <- spread(IL.Census,key = variable, value = estimate)
IL.Census
## Simple feature collection with 3123 features and 15 fields (with 2 geometries empty)
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 115638.5 ymin: 4093778 xmax: 457026.6 ymax: 4712668
## projected CRS:  NAD83 / UTM zone 16N
## First 10 features:
##          GEOID                                           NAME broadband
## 1  17001000202      Census Tract 2.02, Adams County, Illinois       739
## 2  17011965100     Census Tract 9651, Bureau County, Illinois      1278
## 3  17019000301  Census Tract 3.01, Champaign County, Illinois      1484
## 4  17019000700     Census Tract 7, Champaign County, Illinois       855
## 5  17019001203 Census Tract 12.03, Champaign County, Illinois      1704
## 6  17019005800    Census Tract 58, Champaign County, Illinois      1322
## 7  17029000400         Census Tract 4, Coles County, Illinois       740
## 8  17031010201     Census Tract 102.01, Cook County, Illinois      1909
## 9  17031030200        Census Tract 302, Cook County, Illinois      2039
## 10 17031031700        Census Tract 317, Cook County, Illinois      2710
##    for.born house.age housevalue income.gt75 med.gross.rent med.income
## 1        21      1949        721          92            639      42750
## 2       232      1964       1292         275            684      54128
## 3       783      1988         11          34            912       7166
## 4       512      1964        691          79            763      38036
## 5       324      1976       1594         675           1108      73463
## 6       741      1939        823         576            947      50160
## 7        36      1955        455          47            586      29548
## 8      2663      1945        781         401            989      40841
## 9       442      1939       1209         880           1052      64089
## 10      970      1939       1001        1198            914      44555
##    month.house.cost owneroccupied totalbeds tothouse totp workhome
## 1               679           721      1163     1163 2473       45
## 2               774          1292      1831     1831 3898       69
## 3               918            11      2277     2277 4740      103
## 4               724           691      1708     1708 3528       50
## 5              1135          1594      2033     2033 4788       93
## 6               973           823      1809     1809 4013       81
## 7               550           455      1434     1434 2593        7
## 8              1039           781      2994     2994 7197      150
## 9              1285          1209      2753     2753 5103      191
## 10             1101          1001      3835     3835 7001      305
##                          geometry
## 1  MULTIPOLYGON (((124599.3 44...
## 2  MULTIPOLYGON (((312861.2 45...
## 3  MULTIPOLYGON (((394176 4440...
## 4  MULTIPOLYGON (((392830.2 44...
## 5  MULTIPOLYGON (((389596 4439...
## 6  MULTIPOLYGON (((396222.7 44...
## 7  MULTIPOLYGON (((381928.3 43...
## 8  MULTIPOLYGON (((443315.8 46...
## 9  MULTIPOLYGON (((444458.8 46...
## 10 MULTIPOLYGON (((444692.4 46...
# If you are doing this on a table, not a spatial file, you want the pivot_wider command

Finally, remove any empty polygons as they mess up later code

IL.Census <- IL.Census[ ! st_is_empty( IL.Census ) , ]

8.2.3 Converting totals to percentages / densities

At the moment, we have the total number of people who make more than $75000 USD. Given there are a different number of people in each census tract, it would be better to look at the percentage of people who make more than $75000.

We can do this using the mutate() command. This allows us to do basic math on table data, in this case dividing the number of people who make more than $75K by the total population for each row/census tract.

#make a percentage of high earners column
IL.Census <- mutate(IL.Census, per.income.gt75 = income.gt75/totp)
#and a percentage of foreign born column
IL.Census <- mutate(IL.Census, per.for.born = for.born/totp)
#and a percentage of who work from home born column
IL.Census <- mutate(IL.Census, per.workhome = workhome/totp)

#and a percentage of owner occupied housing (DIVIDING BY TOTAL NUMBER OF HOUSES)
IL.Census <- mutate(IL.Census, per.owneroccupied = owneroccupied/tothouse)
#and a percentage with broadband
IL.Census <- mutate(IL.Census, per.broadband = broadband/tothouse)

IL.Census
## Simple feature collection with 3121 features and 20 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 115638.5 ymin: 4093778 xmax: 457026.6 ymax: 4712668
## projected CRS:  NAD83 / UTM zone 16N
## First 10 features:
##          GEOID                                           NAME broadband
## 1  17001000202      Census Tract 2.02, Adams County, Illinois       739
## 2  17011965100     Census Tract 9651, Bureau County, Illinois      1278
## 3  17019000301  Census Tract 3.01, Champaign County, Illinois      1484
## 4  17019000700     Census Tract 7, Champaign County, Illinois       855
## 5  17019001203 Census Tract 12.03, Champaign County, Illinois      1704
## 6  17019005800    Census Tract 58, Champaign County, Illinois      1322
## 7  17029000400         Census Tract 4, Coles County, Illinois       740
## 8  17031010201     Census Tract 102.01, Cook County, Illinois      1909
## 9  17031030200        Census Tract 302, Cook County, Illinois      2039
## 10 17031031700        Census Tract 317, Cook County, Illinois      2710
##    for.born house.age housevalue income.gt75 med.gross.rent med.income
## 1        21      1949        721          92            639      42750
## 2       232      1964       1292         275            684      54128
## 3       783      1988         11          34            912       7166
## 4       512      1964        691          79            763      38036
## 5       324      1976       1594         675           1108      73463
## 6       741      1939        823         576            947      50160
## 7        36      1955        455          47            586      29548
## 8      2663      1945        781         401            989      40841
## 9       442      1939       1209         880           1052      64089
## 10      970      1939       1001        1198            914      44555
##    month.house.cost owneroccupied totalbeds tothouse totp workhome
## 1               679           721      1163     1163 2473       45
## 2               774          1292      1831     1831 3898       69
## 3               918            11      2277     2277 4740      103
## 4               724           691      1708     1708 3528       50
## 5              1135          1594      2033     2033 4788       93
## 6               973           823      1809     1809 4013       81
## 7               550           455      1434     1434 2593        7
## 8              1039           781      2994     2994 7197      150
## 9              1285          1209      2753     2753 5103      191
## 10             1101          1001      3835     3835 7001      305
##                          geometry per.income.gt75 per.for.born per.workhome
## 1  MULTIPOLYGON (((124599.3 44...     0.037201779   0.00849171  0.018196522
## 2  MULTIPOLYGON (((312861.2 45...     0.070548999   0.05951770  0.017701385
## 3  MULTIPOLYGON (((394176 4440...     0.007172996   0.16518987  0.021729958
## 4  MULTIPOLYGON (((392830.2 44...     0.022392290   0.14512472  0.014172336
## 5  MULTIPOLYGON (((389596 4439...     0.140977444   0.06766917  0.019423559
## 6  MULTIPOLYGON (((396222.7 44...     0.143533516   0.18464989  0.020184401
## 7  MULTIPOLYGON (((381928.3 43...     0.018125723   0.01388353  0.002699576
## 8  MULTIPOLYGON (((443315.8 46...     0.055717660   0.37001528  0.020842018
## 9  MULTIPOLYGON (((444458.8 46...     0.172447580   0.08661572  0.037428963
## 10 MULTIPOLYGON (((444692.4 46...     0.171118412   0.13855164  0.043565205
##    per.owneroccupied per.broadband
## 1        0.619948409     0.6354256
## 2        0.705625341     0.6979792
## 3        0.004830918     0.6517347
## 4        0.404566745     0.5005855
## 5        0.784062961     0.8381702
## 6        0.454947485     0.7307905
## 7        0.317294282     0.5160391
## 8        0.260855043     0.6376086
## 9        0.439157283     0.7406466
## 10       0.261016949     0.7066493

We can also work out the population density. First we need the spatial area of each census tract. We can do this using the st_area() command, then use mutate again to add the population density column

area.m2 <- st_area(IL.Census)
# This is in metres^2. Convert to Km sq
area.km2 <- set_units(area.m2, "km^2")
area.km2 <- as.numeric(area.km2)
IL.Census <- mutate(IL.Census, pop.density = totp/area.km2)

# and the housing density
IL.Census <- mutate(IL.Census, house.density = tothouse/area.km2)

IL.Census
## Simple feature collection with 3121 features and 22 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 115638.5 ymin: 4093778 xmax: 457026.6 ymax: 4712668
## projected CRS:  NAD83 / UTM zone 16N
## First 10 features:
##          GEOID                                           NAME broadband
## 1  17001000202      Census Tract 2.02, Adams County, Illinois       739
## 2  17011965100     Census Tract 9651, Bureau County, Illinois      1278
## 3  17019000301  Census Tract 3.01, Champaign County, Illinois      1484
## 4  17019000700     Census Tract 7, Champaign County, Illinois       855
## 5  17019001203 Census Tract 12.03, Champaign County, Illinois      1704
## 6  17019005800    Census Tract 58, Champaign County, Illinois      1322
## 7  17029000400         Census Tract 4, Coles County, Illinois       740
## 8  17031010201     Census Tract 102.01, Cook County, Illinois      1909
## 9  17031030200        Census Tract 302, Cook County, Illinois      2039
## 10 17031031700        Census Tract 317, Cook County, Illinois      2710
##    for.born house.age housevalue income.gt75 med.gross.rent med.income
## 1        21      1949        721          92            639      42750
## 2       232      1964       1292         275            684      54128
## 3       783      1988         11          34            912       7166
## 4       512      1964        691          79            763      38036
## 5       324      1976       1594         675           1108      73463
## 6       741      1939        823         576            947      50160
## 7        36      1955        455          47            586      29548
## 8      2663      1945        781         401            989      40841
## 9       442      1939       1209         880           1052      64089
## 10      970      1939       1001        1198            914      44555
##    month.house.cost owneroccupied totalbeds tothouse totp workhome
## 1               679           721      1163     1163 2473       45
## 2               774          1292      1831     1831 3898       69
## 3               918            11      2277     2277 4740      103
## 4               724           691      1708     1708 3528       50
## 5              1135          1594      2033     2033 4788       93
## 6               973           823      1809     1809 4013       81
## 7               550           455      1434     1434 2593        7
## 8              1039           781      2994     2994 7197      150
## 9              1285          1209      2753     2753 5103      191
## 10             1101          1001      3835     3835 7001      305
##                          geometry per.income.gt75 per.for.born per.workhome
## 1  MULTIPOLYGON (((124599.3 44...     0.037201779   0.00849171  0.018196522
## 2  MULTIPOLYGON (((312861.2 45...     0.070548999   0.05951770  0.017701385
## 3  MULTIPOLYGON (((394176 4440...     0.007172996   0.16518987  0.021729958
## 4  MULTIPOLYGON (((392830.2 44...     0.022392290   0.14512472  0.014172336
## 5  MULTIPOLYGON (((389596 4439...     0.140977444   0.06766917  0.019423559
## 6  MULTIPOLYGON (((396222.7 44...     0.143533516   0.18464989  0.020184401
## 7  MULTIPOLYGON (((381928.3 43...     0.018125723   0.01388353  0.002699576
## 8  MULTIPOLYGON (((443315.8 46...     0.055717660   0.37001528  0.020842018
## 9  MULTIPOLYGON (((444458.8 46...     0.172447580   0.08661572  0.037428963
## 10 MULTIPOLYGON (((444692.4 46...     0.171118412   0.13855164  0.043565205
##    per.owneroccupied per.broadband pop.density house.density
## 1        0.619948409     0.6354256  1627.21822     765.24658
## 2        0.705625341     0.6979792    85.40026      40.11490
## 3        0.004830918     0.6517347 10392.74423    4992.46384
## 4        0.404566745     0.5005855  1361.80685     659.28744
## 5        0.784062961     0.8381702  1851.45227     786.13251
## 6        0.454947485     0.7307905  2772.57577    1249.83543
## 7        0.317294282     0.5160391    82.29480      45.51128
## 8        0.260855043     0.6376086 14339.72078    5965.41948
## 9        0.439157283     0.7406466  7713.76224    4161.47118
## 10       0.261016949     0.7066493 14691.20420    8047.53151

8.2.4 Plotting the data

OK, we now have a load of data loaded for Illinois, let’s take a look at it using tmap. For some reason, R is making me run each code chunk twice. If you see the warning, “The shape IL.Census contains empty units”, then just run the code chunk again. Here, I have looked at the two population

# create map 1
map1 <- tm_shape(IL.Census,  unit = "mi") +                      
           tm_polygons(col="totp",    
                       style="pretty",    
                       border.col = NULL,  
                       palette="YlGnBu",
                       title = "", # using the more sophisticated tm_layout command
                       legend.hist = TRUE)   +
  tm_scale_bar(breaks = c(0, 10, 20)) +
  tm_layout(main.title = "Total Population",  main.title.size = 0.95, frame = FALSE) +
  tm_layout(legend.outside = TRUE) 
   

map2 <- tm_shape(IL.Census, unit = "mi") + 
           tm_polygons(col = "pop.density", 
                       style = "quantile",
                       palette = "Reds", 
                       border.alpha = 0, 
                       title = "") +  # using the more sophisticated tm_layout command
  tm_scale_bar(breaks = c(0, 10, 20)) +
  tm_compass(type = "4star", position = c("left", "bottom")) + 
  tm_layout(main.title = "Population density",  main.title.size = 0.95, frame = FALSE) +
  tm_layout(legend.outside = TRUE) 

tmap_arrange(map1,map2)
## Compass not supported in view mode.

8.2.5 Zooming in

This map looks at all of Illinois. But we want to look at the area just around Chicago.

8.2.5.1 Cropping to a lat long box

There are two ways we could do this. The first way is to simply “draw” a new lat/long box for the data using the raster crop function. Because the data is in the UTM map projection, I worked out my new bounding box coordinates here: https://www.geoplaner.com/

I have commented out this code chunk, because for administrative data, there is often a better way (see below)

# My new region from https://www.geoplaner.com/
#Crop.Region <- as(extent(361464,470967,4552638,4701992), "SpatialPolygons")
# Make IL.census an sp format
#IL.Census.sp <- as(IL.Census,"Spatial")
# And set the projection to be the same as the Illinois data
#proj4string(Crop.Region) <- CRS(proj4string(IL.Census.sp))

# Subset the polygons to my new region
#IL.Census.sp <- crop(IL.Census.sp, Crop.Region, byid=TRUE)

# and convert back to sf
#IL.Census.Box <- st_as_sf(IL.Census.sp)

# Finally plot
#tm_shape(IL.Census.Box, unit = "mi") + 
#           tm_polygons(col = "pop.density", style = "quantile",
#                       palette = "Reds", border.alpha = 0, 
#                       title = "Population Density")+
#    tm_layout(legend.outside = TRUE) 

8.2.5.2 Loading city boundary data

Instead of a box, we might want to crop to administrative boundaries. We can download these using the Tigris package.

This has many thousands of boundary datasets that you can explore here. Tigris is a really powerful package. For a tutorial, see here [https://crd150.github.io/lab4.html#tigris_package] and here [https://github.com/walkerke/tigris]

For now, lets download the Chicago metropolitan area data and the city border

cb            <- core_based_statistical_areas(cb = TRUE, year=2017)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%
Chicago.metro <- filter(cb, grepl("Chicago", NAME))
# and set the projection to be identical to the census data
Chicago.metro <- st_transform(Chicago.metro,crs(IL.Census))


#REMEMBER TO CHANGE THE STATE FOR YOUR CITY
pl           <- places(state = "IL", cb = TRUE, year=2017)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  16%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  64%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%
Chicago.city <- filter(pl, NAME == "Chicago")
# and set the projection to be identical to the census data
Chicago.city <- st_transform(Chicago.city,crs(IL.Census))

8.2.5.3 Cropping to a city boundary

We can now crop our census data to this specific area using the ms_clip function.

# subset the illinois census data with the Chicago city limits
Chicago.city.Census <- ms_clip(target = IL.Census, clip = Chicago.city, remove_slivers = TRUE)

tm_shape(Chicago.city.Census, unit = "mi") + 
           tm_polygons(col = "pop.density", style = "quantile",
                       palette = "Reds", border.alpha = 0, 
                       title = "Population Density")+
    tm_shape(Chicago.city) +
           tm_borders()+
    tm_layout(legend.outside = TRUE) 

8.3 Challenge 1

Tutorial one downloaded census data using the tidycensus package, then used the tigris package to download city boundary file. We then did some data wrangling and subset the data to the city boundary.

  1. Our data comes from the American Community Survey collected by the census (https://www.census.gov/programs-surveys/acs/about.html). Describe this dataset. What is it? Where does it come from and how is it different from the census?

  2. Your job is now to choose a different city in the US and to repeat the process above. Remember to choose an appropriate map projection. [Hint, choose a large city with many census tracts & ideally WITHOUT a single “downtown outlier” like New York - if you are struggling to think of somewhere, try Philadelphia or New Orleans].

    1. Your code should be easy to read with comments in each code chunk. If I deleted the actual code in each chunk, I should still be able to understand what is going on.
    2. IN ADDITION to the census variables already included in the tutorial, I would like you to also include:
      • The Gini Index of income inequality: B19083_001
      • The total number of people who report having a bachelors degree: B06009_005
      • any other variables you are interested in (optional)
    3. As you work through the tutorial, I would like you to also convert the total number of people who report having a bachelors degree to a percentage of people per census tract that hold such a degree.
    4. I would like you to include a map of the Gini index as part of your output (clearly marked with a subheading)

8.4 Tutorial 2 Regression

8.4.1 Basics

Now we have some data, let’s do some regression analysis on it. Building on the lectures, first let’s look at the basics.

I would like to understand whether the percentage of people making more than $75,000 in each census tract in Chicago is influenced by some of the other demographic factors that I downloaded. First, I could look at the general correlation between each one of my variables (e.g. how well does a linear model fit the data).

For example, the correlation between the median income in each census tract and the percentage making > $75,000 can be obtained using the cor command.

cor(Chicago.city.Census$med.income,
    Chicago.city.Census$per.income.gt75,use="pairwise.complete.obs")
## [1] 0.8672256

Given that more highly paid people should bump up the median income, unsurprisingly there is a reasonable relationship between the two variables! In fact, the cor command can go even further and check the correlation between all the variables in our dataset that we care about:

# this will only work for columns containing numbers, so I am explicitly naming the ones I want
# the st_drop_geometry command makes it not spatial data for a second
corr<-cor(st_drop_geometry(Chicago.city.Census[,c("housevalue","pop.density","house.density",
                                                  "per.income.gt75","med.income","totalbeds",
                                                  "per.for.born","med.gross.rent","per.broadband",
                                                  "per.owneroccupied","per.workhome",
                                                  "house.age","month.house.cost")]),use="pairwise.complete.obs")

# Now we can make a cool plot of this, check out other method
corrplot(corr,method="number",number.cex=.5)

We can also look at the distribution of individual values for a given variable:

# GGplot fancy histogram

# %>% means push the output to the next line (or wherever the > 'points')
# + means add a subcommand
# R graph gallery

Chicago.city.Census %>%                  
   ggplot( aes(x=per.income.gt75)) +  
    geom_histogram(fill="#69b3a2") +
    ggtitle("Percentage with income > $75000") +
    theme_ipsum() +
    theme(plot.title = element_text(size=15))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).

#Also a summary
summary(Chicago.city.Census$per.income.gt75)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.03009 0.06601 0.10862 0.16237 0.54485       3

I notice that there are 3 missing values, so I will remove those as they will mess up the spatial regression analysis

# Convert to sp
Chicago.city.Census.sp <- as(Chicago.city.Census,"Spatial")

#Remove all rows containing any missing values. We'll lose data, but for now it will make everything work
# from the spatial eco package
Chicago.city.Census.sp <- sp.na.omit(Chicago.city.Census.sp, margin = 1)
## Deleting rows: 387476732764138227374691
#And convert back to sf
Chicago.city.Census <- st_as_sf(Chicago.city.Census.sp)

8.4.2 Basic regression

We can make a basic scatterplot using the plot command. Alternatively, we can make an interactive plot using ggplot. I want to explore whether people who make over $75000 are more likely to work from home(pre-COVID).

BUT! The ecological fallacy means we can’t fully do this. ALL WE CAN TEST IS: Do census tracts which contain more people who work from home ALSO contain more people who make > $75000.. The people who make more than $75K might be the ones who travel to work but just live in the same place.

# Make an interactive plot
# http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually

p <- Chicago.city.Census %>%                  
  ggplot( aes( per.income.gt75, per.workhome, label=NAME)) +
  geom_point() +
  theme_classic()+
  scale_color_gradient(low="blue", high="red")

ggplotly(p)

OK, we can see that there appears to be a positive relationship between the two, but a lot of spread. Now let’s make a linear fit using the OLS package. We can see that this has an R2 of 0.497 e.g. the model can explain ~50% of the variance (spread) seen in the data.

# using the OLS package. Pretty output but some other functions don't work.
fit1.ols <- ols_regress (per.workhome ~ per.income.gt75, data = Chicago.city.Census)
# and using base R
fit1.lm <- lm(per.workhome ~ per.income.gt75, data = Chicago.city.Census,na.action="na.exclude")

fit1.ols
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.705       RMSE                0.014 
## R-Squared               0.498       Coef. Var          63.614 
## Adj. R-Squared          0.497       MSE                 0.000 
## Pred R-Squared          0.494       MAE                 0.010 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                 ANOVA                                 
## ---------------------------------------------------------------------
##                Sum of                                                
##               Squares         DF    Mean Square       F         Sig. 
## ---------------------------------------------------------------------
## Regression      0.145          1          0.145    769.652    0.0000 
## Residual        0.147        777          0.000                      
## Total           0.292        778                                     
## ---------------------------------------------------------------------
## 
##                                    Parameter Estimates                                    
## -----------------------------------------------------------------------------------------
##           model     Beta    Std. Error    Std. Beta      t        Sig     lower    upper 
## -----------------------------------------------------------------------------------------
##     (Intercept)    0.008         0.001                 11.143    0.000    0.006    0.009 
## per.income.gt75    0.127         0.005        0.705    27.743    0.000    0.118    0.136 
## -----------------------------------------------------------------------------------------

Our model is now: percentage.workfromhome = 0.008+ 0.127xper.income.gt75

We can add a fit using abline (or check out R graph gallery):

plot(Chicago.city.Census$per.workhome ~ Chicago.city.Census$per.income.gt75,pch=16,cex=.5,col="blue")
abline(fit1.lm)

Now, lets see if adding a second variable makes the fit better. I’m guessing that census tracts where people work from home more often also have better broadband. Let’s have a look at the scatterplot of both variables:

# Make an interactive plot
# http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually

p <- Chicago.city.Census %>%                  
  ggplot( aes(per.income.gt75,per.workhome,col= per.broadband,label=NAME)) +
  geom_point() +
  theme_classic()+
  scale_color_gradient(low="blue", high="red")

ggplotly(p)

There seems to be some relationship. Let’s add this to the model and take a look:

# using the OLS package
fit2.ols <- ols_regress (per.workhome ~per.income.gt75 + per.broadband, data = Chicago.city.Census)
fit2.lm  <- lm (per.workhome ~per.income.gt75 + per.broadband, data = Chicago.city.Census,na.action="na.exclude")

fit2.ols
##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.707       RMSE                0.014 
## R-Squared               0.500       Coef. Var          63.529 
## Adj. R-Squared          0.498       MSE                 0.000 
## Pred R-Squared          0.495       MAE                 0.010 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                 ANOVA                                 
## ---------------------------------------------------------------------
##                Sum of                                                
##               Squares         DF    Mean Square       F         Sig. 
## ---------------------------------------------------------------------
## Regression      0.146          2          0.073    387.393    0.0000 
## Residual        0.146        776          0.000                      
## Total           0.292        778                                     
## ---------------------------------------------------------------------
## 
##                                    Parameter Estimates                                     
## ------------------------------------------------------------------------------------------
##           model     Beta    Std. Error    Std. Beta      t        Sig      lower    upper 
## ------------------------------------------------------------------------------------------
##     (Intercept)    0.004         0.002                  1.772    0.077     0.000    0.008 
## per.income.gt75    0.119         0.006        0.662    18.744    0.000     0.107    0.132 
##   per.broadband    0.007         0.004        0.062     1.754    0.080    -0.001    0.016 
## ------------------------------------------------------------------------------------------

So now the model is:

percentage.workfromhome = 0.004 + 0.119xper.income.gt75 + 0.008xper.broadband

The model improved! A little! We can see that the R2 went up to 0.498. But is this enough to be significant or meaningful? (it might be if there is enough data). One way to check is to compare the two models using ANOVA. This conducts a significance test to assess whether there is additional value to adding a new variable. In this case, the p-value is 0.07 - not super low=, so I would need a compelling reason to keep percentage broadband in the model.

anova(fit1.lm,fit2.lm)
## Analysis of Variance Table
## 
## Model 1: per.workhome ~ per.income.gt75
## Model 2: per.workhome ~ per.income.gt75 + per.broadband
##   Res.Df     RSS Df  Sum of Sq     F Pr(>F)  
## 1    777 0.14689                             
## 2    776 0.14631  1 0.00058013 3.077 0.0798 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also use AIC, where the LOWER number is typically the better fit (taking into account overfitting). In this case, we can see that it thinks Broadband is a useful variable to keep even if it has limited influence.

AIC(fit1.lm,fit2.lm)
##         df       AIC
## fit1.lm  3 -4464.082
## fit2.lm  4 -4465.164

So as it’s borderline and AIC suggests it is useful, I will keep it in the model.

8.4.3 Spatial residuals

One of the 4 assumptions around regression is that your data should be independent. We don’t want the residuals to have any influence or knowledge of each other. We know that census data is highly geographic, so let’s look at a MAP of the residuals (e.g. the distance from each point to the model line of best fit, high means the model underestimated the data, negative means the model overestimated the data).

To do, this we first add the residuals to our table

Chicago.city.Census$Fit2.Residuals <- residuals(fit2.lm)

Now let’s have a look! If we have fully explained all the data and the data is spatially independent then there should be no pattern.

We can look at the residuals directly:

# subset the illinois census data with the Chicago city limits
tm_shape(Chicago.city.Census, unit = "mi") + 
           tm_polygons(col = "Fit2.Residuals", style = "quantile",
                       palette = "-RdBu", border.alpha = 0, 
                       title = "Fit 2 residuals")+
    tm_shape(Chicago.city) +
           tm_borders()+
    tm_layout(legend.outside = TRUE) 
## Variable(s) "Fit2.Residuals" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Or.. we can look at the extreme residuals by converting to standard deviation.

Chicago.city.Census$sd_breaks <- scale(Chicago.city.Census$Fit2.Residuals)[,1]
# because scale is made for matrices, we just need to get the first column using [,1]

my_breaks <- c(-14,-3,-2,-1,1,2,3,14)

tm_shape(Chicago.city.Census) + 
  tm_fill("sd_breaks", title = "Residuals", style = "fixed", breaks = my_breaks, palette = "-RdBu") +
  tm_borders(alpha = 0.1) +
  tm_layout(main.title = "Residuals (Standard Deviation away from 0)", main.title.size = 0.7 ,
            legend.position = c("right", "bottom"), legend.title.size = 0.8)+
      tm_layout(legend.outside = TRUE) 
## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.
## Variable(s) "sd_breaks" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

We have a problem! It is definitely not independent random noise - in fact there are little clusters of high residuals near the centre and other residuals around the edge. To test this, we can look at a Moran’s scatterplot with a queen’s spatial matrix. The test confirms highly significant autocorrelation. Our p-values and regression model coefficients cannot be trusted. so let’s try a spatial lag model.

# Moran.plot gets frustrated when the residuals are NA, for now I will just set them to zero
Chicago.city.Census$Fit2.Residuals[is.na(Chicago.city.Census$Fit2.Residuals)] <- 0
spatial.matrix.queen <-poly2nb(Chicago.city.Census, queen=T)

weights.queen <- nb2listw(spatial.matrix.queen, style='B',zero.policy=TRUE)

moran.plot(Chicago.city.Census$Fit2.Residuals, weights.queen,
           xlab = "Model residuals",
           ylab = "Neighbors residuals",zero.policy=TRUE)

moran.test(Chicago.city.Census$Fit2.Residuals, weights.queen,zero.policy=TRUE)
## 
## 	Moran I test under randomisation
## 
## data:  Chicago.city.Census$Fit2.Residuals  
## weights: weights.queen  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 9.493, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.1863818244     -0.0012886598      0.0003908302

8.4.4 Spatial lag model

If the test is significant (as in this case), then we possibly need to think of a more suitable model to represent our data: a spatial regression model. Remember spatial dependence means that (more typically) there will be areas of spatial clustering for the residuals in our regression model. We want a better model that does not display any spatial clustering in the residuals.

There are two general ways of incorporating spatial dependence in a regression model:

  • A spatial error model
  • A spatial lagged model

The difference between these two models is both technical and conceptual. The spatial error model assumes that the:

“spatial dependence observed in our data does not reflect a truly spatial process, but merely the geographical clustering of the sources of the behavior of interest. For example, citizens in adjoining neighborhoods may favour the same (political) candidate not because they talk to their neighbors, but because citizens with similar incomes tend to cluster geographically, and income also predicts vote choice. Such spatial dependence can be termed attributional dependence” (Darmofal, 2015: 4)

The spatially lagged model, on the other hand, incorporates spatial dependence explicitly by adding a “spatially lagged” variable y on the right hand side of our regression equation. It assumes that spatial processes THEMSELVES are an important thing to model:

“If behavior is likely to be highly social in nature, and understanding the interactions between interdependent units is critical to understanding the behavior in question. For example, citizens may discuss politics across adjoining neighbors such that an increase in support for a candidate in one neighborhood directly leads to an increase in support for the candidate in adjoining neighborhoods” (Darmofal, 2015: 4)

Mathematically, it makes sense to run both models and see which fits best. We can do this using the lm.LMtests() function. (note, we are skipping over complexity here!). See here for more details on the full process: https://maczokni.github.io/crimemapping_textbook_bookdown/spatial-regression-models.html

But that goes beyond the scope of this course. Here we will try the spatial lag model, because I can imagine that things like broadband access have explicit spatial relationships (e.g. where the cable goes)

To fit a spatial lag model, we use

fit_2_lag <- lagsarlm(per.workhome ~per.income.gt75 + per.broadband, data = Chicago.city.Census, weights.queen,zero.policy=TRUE)
fit_2_lag
## 
## Call:
## lagsarlm(formula = per.workhome ~ per.income.gt75 + per.broadband, 
##     data = Chicago.city.Census, listw = weights.queen, zero.policy = TRUE)
## Type: lag 
## 
## Coefficients:
##             rho     (Intercept) per.income.gt75   per.broadband 
##     0.051875611     0.002631700     0.084832432     0.003733281 
## 
## Log likelihood: 2285.74

This is now going beyond the scope of this course, instead of a simple linear model, we are running a generalised additive model, which is mathematically more complex:

percentage.workfromhome = rho(WEIGHTS*percentage.workfromhome) + b0 + b1xper.income.gt75 + b2xper.broadband e.g.

percentage.workfromhome = 0.051(WEIGHTS*percentage.workfromhome) + 0.002 + 0.0846xper.income.gt75 + 0.004xper.broadband

You will notice that there is a new term Rho. What is this? This is our spatial lag. It is a variable that measures the percentage working from home in the census tracts SURROUNDING each tract of interest in our spatial weight matrix. We are simply using this variable as an additional explanatory variable to our model, so that we can appropriately take into account the spatial clustering detected by our Moran’s I test. You will notice that the estimated coefficient for this term is both positive and statistically significant. In other words, when the percentage working from home in surrounding areas increases, so does the percentage working from home in each country, even when we adjust for the other explanatory variables in our model.

Let’s use AIC to compare all 3 models.

AIC(fit1.lm,fit2.lm,fit_2_lag)
##           df       AIC
## fit1.lm    3 -4464.082
## fit2.lm    4 -4465.164
## fit_2_lag  5 -4561.481

We see that our new lagged version has the lowest AIC and so is likely to be the best model for predicting the percentage of people who work from home in each census tract.

Now, if we have fully taken into account the spatial autocorrelation of our data, the spatial residuals should show no pattern and no autocorrelation. Let’s take a look@

# Create the residuals
Chicago.city.Census$Fit2.LaggedResiduals <- residuals(fit_2_lag)

Chicago.city.Census$sd_breaks.lagged <- scale(Chicago.city.Census$Fit2.LaggedResiduals)[,1]
# because scale is made for matrices, we just need to get the first column using [,1]

#plot standard deviations
my_breaks <- c(-14,-3,-2,-1,1,2,3,14)
tm_shape(Chicago.city.Census) + 
  tm_fill("sd_breaks", title = "sd_breaks.lagged", style = "fixed", breaks = my_breaks, palette = "-RdBu",midpoint=0) +
  tm_borders(alpha = 0.1) +
  tm_layout(main.title = "Residuals for lagged model (Standard Deviation away from 0)", main.title.size = 0.7 ,
            legend.position = c("right", "bottom"), legend.title.size = 0.8)+
      tm_layout(legend.outside = TRUE) 
## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.
#plot Moran's I
moran.plot(Chicago.city.Census$Fit2.LaggedResiduals, weights.queen,
           xlab = "Model residuals",
           ylab = "Neighbors residuals",zero.policy=TRUE)

moran.test(Chicago.city.Census$Fit2.LaggedResiduals, weights.queen,zero.policy=TRUE)
## 
## 	Moran I test under randomisation
## 
## data:  Chicago.city.Census$Fit2.LaggedResiduals  
## weights: weights.queen  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = -0.59971, p-value = 0.7257
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##     -0.0131409865     -0.0012886598      0.0003905928

Looking at the map, to the eye, there looks to be a spatial pattern, you can see that in the Moran test, the Moran’s I is no longer significant (the p-value is the likelihood this level of autocorrelation could have been seen by chance. A HIGH value of the p-value means that it is likely to have happened by chance, so it’s not significant).

So we have successfully included location in our model. If you still see a significant pattern at this stage, you could consider adjusting your spatial weights matrix (maybe a “neighbor” is a 2nd order queens, or the census tracks within 50km..).

8.4.5 Plotting the results

So we have a model!

Finally, we went to all the trouble to create a model predicting the percentage of people working from home - let’s look at the results.

Chicago.city.Census$Fit2.Lagged.Prediction <- predict(fit2.lm)

# subset the illinois census data with the Chicago city limits
map1 <- tm_shape(Chicago.city.Census, unit = "mi") + 
           tm_polygons(col = "Fit2.Lagged.Prediction", style = "quantile",
                       palette = "-Spectral", border.alpha = 0, 
                       title = "Prediction of the percentage of people working from home")+
    tm_shape(Chicago.city) +
           tm_borders()+
    tm_layout(legend.outside = TRUE) 

map1 <- tm_shape(Chicago.city.Census, unit = "mi") + 
           tm_polygons(col = "per.workhome", style = "quantile",
                       palette = "-Spectral", border.alpha = 0, 
                       title = "ACTUAL percentage of people working from home")+
    tm_shape(Chicago.city) +
           tm_borders()+
    tm_layout(legend.outside = TRUE) 

8.5 Challenge 2

Make a new sub-heading called Challenge 2. Below, answer these questions. Make sure they are clearly marked to make it easy to find and grade them.

  1. What are the 4 assumptions underpinning linear regression?
  2. If the data is spatially autocorrelated, which assumption is broken?
  3. What is the ecological fallacy and why is it important when looking at census data (hint, look at Lecture 4 and in this lab)
  4. If I tested 3 models using AIC and saw the values (Model1: -2403, Model2: -2746, Model3:-3102), which model would I be likely to choose?

Tutorial two applied regression analysis to our census data, focusing on the percentage of people working from home

Your job is now to apply this analysis to your City from part 1 and to repeat the process above. FIRST get it working to predict the percentage of people working from home.

Make sure to tell me what you are doing at each stage and to interpret your output. Maybe this relationship just exists for Chicago… Tell in a few sentences which model you chose, why and plot the predicted values.

BEFORE YOU START, READ THE SHOW ME SOMETHING NEW OPTIONS.

8.6 Lab 7 Show me something new

For 5/5 you can continue to do the classic “something new”, where you need to demonstrates the use of a function or package that was not specifically covered in the handout, lecture, or lab. Remember you actually have to do something new, not repeat what you did in previous weeks

OR

For 5/5 : Instead of predicting the percentage of people working from home, do your analysis predicting the median house value per census tract (housevalue). You should look at the list of variables shown in the corrplot, decide on some variables you think might predict house values, check the residuals for spatial dependence and try a spatial lag model. Then choose the best model as suggested by AIC.

8.7 Lab-7 submission check

For this lab, here is the mark breakdown:

HTML FILE SUBMISSION - 10 marks

RMD CODE SUBMISSION - 10 marks

WORKING CODE - 10 marks: Your code works and the output of each code chunk is included in the html file output (e.g. you pressed run-all before you finished)

EASY TO READ LAB SCRIPT - 10 marks: You have followed the style guide in section 7.1. You have commented your code chunks. Even if I had deleted the code itself, I would still be able to understand what you are doing in each code chunk from your comments.

CHALLENGE 1a - 5 marks: You thoughtfully describe the American Community Survey data and answer the questions

CHALLENGE 1b - 10 marks: You manage to download the ACS data for a new city, including the Gini index of income inequality and the PERCENTAGE of people with a bachelors degree. Your code makes sense, e.g. it doesn’t refer to chicago the whole time.

CHALLENGE 1c - 5 marks: You have professionally plotted the Gini data for your city.

CHALLENGE 2a - 8 marks: You have explained the 4 assumptions underpinning regression (using more than a single word for each one)

CHALLENGE 2b - 4 marks: You have explained which assumption is broken for spatial autocorrelation and why (you get partial marks for good reasoning even if you are not sure of the answer)

CHALLENGE 2c - 4 marks: You have described the ecological fallacy and why it’s important

CHALLENGE 2d - 4 marks: You have correctly answered the AIC question

CHALLENGE 2e - 15 marks: You have applied a regression analysis to your own data. Your code makes sense, e.g. it doesn’t refer to chicago the whole time and you have interpreted the output correctly.

SOMETHING NEW - 5 marks You demonstrated the use of a function or concept that was not specifically covered in the handout, lecture, or lab OR Your regression analysis focused on house value not the percentage of people working from home.

[100 marks total]